This course covers the standard analysis of a single-cell RNA-seq dataset using the Seurat pipeline. It follows to some extent the standard tutorial from Seurat, which we encourage the readers to consult when they want to dive more into single-cell data analyis.
In this course, we will use an example data set consisting of 7,000 cells from the Fly Cell Atlas repository. These cells all come from the Body part, which was dissociated as a whole and sequenced using 10x Genomics technology over 10 batches, mixing males and females.
This set of 7,000 cells was carefully selected/filtered from the original Body dataset that consists of 96,926 cells. They belong to a certain number of differently annotated cell types.
The goal of this exercise session is to:
First, you need to download the dataset from the GitHub repository: data.ENHPATHY.drosophila.melanogaster.txt.gz and store it on your computer, in a path that you will remember for the next script.
Note: You don’t need to unzip it for the script to run. But it’s a good practice in general to check a file (format, content, headers) before trying to load it in R
Here, if you inspect the file, you should see something like that:
Cell_1 Cell_2 Cell_3 Cell_4 Cell_5 Cell_6 Cell_7 Cell_8 Cell_9 Cell_10 ...
128up 0 0 0 0 0 0 0 0 0 0 ...
14-3-3epsilon 1 1 0 0 0 0 0 0 0 0 ...
14-3-3zeta 2 2 1 2 1 3 0 0 2 2 ...
140up 0 0 0 0 0 0 0 0 0 0 ...
18SrRNA-Psi:CR41602 0 0 0 0 0 0 0 0 0 0 ...
18w 0 0 0 0 0 0 0 1 1 0 ...
... ... ... ... ... ... ... ... ... ... ...
Question: What do you see in this file? rows? columns? headers? separation? first column? values? It’s important to know how the file is formatted, to be able to parse it correctly in the next step.
To be able to run the Seurat pipeline, you need to create a Seurat object in R. This step mainly depends on the format of the data you have, to start with. For example:
Here, the data is in text format. In this case, the first thing to do is to read your file in R and create a data.frame/matrix containing your object. Since the file is gzipped, and the file is big enough, we will use the package data.table to do so. Especially the fread() function in this package:
library(data.table) # This states that you want to load all functions in this library, includind fread()
# setwd("Path to your working directory, where you downloaded the dataset")
data.enhpathy <- fread("data.ENHPATHY.drosophila.melanogaster.txt.gz", data.table = F) Question: Check the parameters of the
freadcommand, should we add others? For e.g.sep = "\t"orheader = T?
If the fread command fails, displaying the following error message, it means that you probably did not set correctly your working directory (setwd command, see the comment in the last code chunk).
Error in fread("data.ENHPATHY.drosophila.melanogaster.txt.gz") :
File 'data.ENHPATHY.drosophila.melanogaster.txt.gz' does not exist or is non-readable. getwd()=='your_current_working_directory'
If not, then a new object called data.enhpathy should be created in the “Environment” tab at the top right of your Rstudio windows. The number of “obs. = observations” corresponds to the number of rows of your matrix, while the “variables” corresponds to the number of columns.
Note: If you click on the object name in the “Environment tab”, Rstudio will open a “View tab” to look at the content of you object. But be careful (maybe it’s not a good idea here), because if this object is too big for your RAM, your computer may not be able to open it (Rstudio crashes).
You can see some information about this object typing the following commands:
nrow(data.enhpathy)
ncol(data.enhpathy)
class(data.enhpathy)Or even peak into its content (first 8 rows/8 cols)
data.enhpathy[1:8, 1:8] V1 Cell_1 Cell_2 Cell_3 Cell_4 Cell_5 Cell_6 Cell_7
1 128up 0 0 0 0 0 0 0
2 14-3-3epsilon 1 1 0 0 0 0 0
3 14-3-3zeta 2 2 1 2 1 3 0
4 140up 0 0 0 0 0 0 0
5 18SrRNA-Psi:CR41602 0 0 0 0 0 0 0
6 18w 0 0 0 0 0 0 0
7 26-29-p 0 0 0 0 0 0 0
8 28SrRNA-Psi:CR40596 0 0 0 1 0 0 0
Note: You can see that indexing in R starts at 1, not 0. Also it seems that first column is not set up correctly.
A data.frame object has col.names and row.names metadata attached to it, in addition to the main data matrix. In our case, col.names correctly contains the cell names, but row.names are empty (well not empty, but filled by default with indexes 1,2,3,4,…) Indeed, instead of being in the metadata row.names, row names are in the first column of the data matrix, whose column name is default to V1. To solve the issue, we need to set the row names with the values in column 1, and then remove the first column of the data.frame.
row.names(data.enhpathy) <- data.enhpathy[,1] # Setting row.names as first column of the data matrix
data.enhpathy <- data.enhpathy[,-1] # Removing first column
data.enhpathy[1:8, 1:8] # Peaking at the new content of the data.frame Cell_1 Cell_2 Cell_3 Cell_4 Cell_5 Cell_6 Cell_7 Cell_8
128up 0 0 0 0 0 0 0 0
14-3-3epsilon 1 1 0 0 0 0 0 0
14-3-3zeta 2 2 1 2 1 3 0 0
140up 0 0 0 0 0 0 0 0
18SrRNA-Psi:CR41602 0 0 0 0 0 0 0 0
18w 0 0 0 0 0 0 0 1
26-29-p 0 0 0 0 0 0 0 0
28SrRNA-Psi:CR40596 0 0 0 1 0 0 0 0
Note: Now, the
data.enhpathyobject should be 12818 genes x 7000 cells
library(Seurat)
data.seurat <- CreateSeuratObject(counts = data.enhpathy, project = "ENHPATHY") # Create our Seurat object using our data matrix
data.seurat # Check the created variableAn object of class Seurat
12818 features across 7000 samples within 1 assay
Active assay: RNA (12818 features, 0 variable features)
For the rest of the course, we will work on the Seurat object. Thus we can remove the original dataset from memory (clear some space).
rm(data.enhpathy)In datasets from the Fly Cell Atlases, gene nomenclature follows FlyBase standards. In particular, mitochondrial gene names start with mt:.
rownames(data.seurat)[startsWith(rownames(data.seurat),"mt:")] [1] "mt:ATPase6" "mt:ATPase8" "mt:CoI" "mt:CoII"
[5] "mt:CoIII" "mt:Cyt-b" "mt:ND1" "mt:ND2"
[9] "mt:ND3" "mt:ND4" "mt:ND4L" "mt:ND5"
[13] "mt:ND6" "mt:lrRNA" "mt:srRNA" "mt:tRNA:Ala-TGC"
[17] "mt:tRNA:Arg-TCG" "mt:tRNA:Cys-GCA" "mt:tRNA:Ile-GAT" "mt:tRNA:Leu-TAG"
[21] "mt:tRNA:Ser-GCT" "mt:tRNA:Val-TAC"
So we will create a new metadata called percent.mt that will contain, for each cell, the percentage of reads/UMIs that map to these “mt:” genes (vs all genes)
data.seurat[["percent.mt"]] <- PercentageFeatureSet(data.seurat, pattern = "^mt:")Then, we will visualize it as a violin plot, along with two other (already pre-computed) metadata:
nFeature_RNA: Number of detected genes (>0 reads) per cellnCount_RNA: Number of reads/UMIs per cellVlnPlot(data.seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)Question: Do you see any outlier cells, or cells with lower quality?
You can display any combination of the metadata as a scatter plot.
FeatureScatter(data.seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")Question: What effect do you see in this plot? Is it expected? What is the top value (0.97)?
Now that we have a better view of potential outlier cells, we can use any combination of metadata / thresholds for filtering “bad QC” cells using the following example command line. Since thresholding is a bit arbitrary, I’ll let you decide for our dataset.
data.seurat <- subset(data.seurat, subset = nFeature_RNA > 200 & nFeature_RNA < 2500) # & ...
VlnPlot(data.seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) # Check again if it looks fine/betterQuestion: What kind of filtering would be the more appropriate in our case? Apply it by modifying the last command appropriately.
We saw that cells aggregate different number of reads. This is an issue for the next steps, as the gene expressions will not be directly comparable between two cells.
data.enhpathy <- as.data.frame(data.seurat@assays$RNA@counts[, 1:20]) # Retrieve the count matrix from the Seurat object (20 first columns)
barplot(colSums(data.enhpathy), las = 2, ylab = "# Reads / UMIs", main = "Number of reads per cell, before normalization") # 10 first cellsSo, for solving the issue, we normalize the data. By default, Seurat uses a LogNormalize method that normalize the data to some scaling factor, i.e. for each cell divide the counts by the total number of reads for that cell, and then multiply by the scale.factor. Data is then natural-log transformed with log(1 + x)
data.seurat <- NormalizeData(data.seurat, normalization.method = "LogNormalize", scale.factor = 10000)So now, if we check the sum of gene expression for each cell:
data.enhpathy <- as.data.frame(data.seurat@assays$RNA@data[, 1:20]) # Retrieve the normalized matrix from the Seurat object (20 first columns)
data.depth <- colSums(exp(data.enhpathy) - 1) # reverse the log(1 + x) to unlogged values
barplot(data.depth, las = 2, ylab = "# Reads / UMIs", main = "Sum of gene expression per cell, after normalization") # 10 first cellsQuestion: Did the normalization worked? Can you comment on the y-axis values?
We calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). These are likely to be the most interesting genes, indeed non-varying genes (even if highly expressed) do not bring any information on the heterogeneity / segregation between the cell types.
Additionally, in some steps, restricting the calculation to only the HVG allows for:
data.seurat <- FindVariableFeatures(data.seurat, selection.method = "vst", nfeatures = 2000) # Here I arbitrarily take 2000 top HVG (default)Usually, results are displayed as a scatter plot of gene expression vs gene variance. Indeed, one would expect that, because of technical bias, the lowly expressed is a gene, the more variable it is (noise).
Seurat standardize the variance across expression levels, which allows for a common thresholding regardless of expression level. Then, it takes top 2000 (by default, but can be tuned) genes by standardized variance.
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(data.seurat), 10)
# plot variable features with labels for top 10 HVGs
LabelPoints(plot = VariableFeaturePlot(data.seurat), points = top10, repel = TRUE, xnudge = 0, ynudge = 0)Seurat merged the Scaling and RegressOut function into a single function. This is a bit confusing, but it makes sense in terms of “order of the operations” => it does not allow the user to perform RegressOut on Scaled data for e.g., which would be wrong.
By default, we can simply scale the data, and not regress out any covariate. Unless we know important covariates that are present in our data, and can prevent us to see the desired biological signal. For e.g. if the data was processed in batches, it could be useful to regress out these batches, to avoid this signal to be present in our final dataset.
Here, to simply scale the data, we can run the following:
all.genes <- rownames(data.seurat) # Here I want to scale all genes. By default, only HVG genes are scaled (speed up computation)
data.seurat <- ScaleData(data.seurat, features = all.genes)Let’s take an example of a potential covariate: we have 3 batches in our dataset (for e.g. three 10x runs that were merged together, or three RNA extraction merged into one 10x run), it could be that the batch signal is stronger than the biological signal (cell types), and thus the UMAP/t-SNE/PCA plot shows 3 clusters, perfectly matching our batches, instead of separating the cells per cell type. In this case the batch covariate needs to be regressed out.
If you don’t know any potential covariate, or you know some (like sex) but you don’t believe they should be stronger than your main signal, then you can go without regressing-out anything, to dimension reduction / clustering / differential expression, and then eventually “ping-pong” back to this step, to filter any found covariate.
If we want to regress out one (or multiple) covariates, we can use the following (for e.g. here, regressing out “percent.mt”). In our case, we probably don’t need to do it.
# This step takes a bit of time to run since it needs to perform a linear regression on each gene, separately
data.seurat <- ScaleData(data.seurat, features = all.genes, vars.to.regress = "percent.mt") Note: The “regress out” step performs a simple linear regression, and return the residuals, which are then scaled
This step is crucial for the remaining of the Seurat pipeline, since many of the downstream tools will use the PCA as input (UMAP, t-SNE, clustering).
By default, the PCA is run only on the more meaningful features (HVG), both for removing noise, and for speeding up the computation.
data.seurat <- RunPCA(data.seurat)
DimPlot(data.seurat, reduction = "pca") # To visualize the results of the PCANote: Selecting the optimal number of principal components (PCs) from the PCA for downstream analysis can be tedious. One alternative is to always take 10 first PCs (or 50 first PCs), but it is very arbitrary. A more optimal solution (that we will not describe here), is to use JackStraw plots. We don’t have time to elaborate on this today, but we recommend the reader to have a look at the Seurat tutorial for optimally selecting top X PCs that are significant.
Usually, for single-cell RNA-seq datasets, PCA is not very good at showing the complexity of the data in its two first components (as we can see in the previous plot, and in general it’s far worse than this). Therefore, we are more likely to use dimension reduction techniques that are more tailored towards 2D visualization, such as t-SNE or UMAP. Both of these methods run directly on the PCA (not on the original scaled matrix).
First, I’ll run a UMAP.
# Here I arbitrarily select 10 PCs from the PCA to run the UMAP/t-SNE. See the last Note, for a better explanation on how best to select the number of PCs
data.seurat <- RunUMAP(data.seurat, dims = 1:10)
DimPlot(data.seurat, reduction = "umap") # To visualize the results of the UMAPThen, I’ll run a t-SNE.
# Here I arbitrarily select 10 PCs from the PCA to run the UMAP/t-SNE. See the last Note, for a better explanation on how best to select the number of PCs
data.seurat <- RunTSNE(data.seurat, dims = 1:10)
DimPlot(data.seurat, reduction = "tsne") # To visualize the results of the UMAPNote: I would personally rely better on the UMAP results than the t-SNE, because UMAP preserves the distance inter-clusters. But it’s open to interpretation, and different bioinformaticians will prefer one or the other (it can also depend on the dataset your are studying). Also remember that both these methods are heuristics, i.e. if no random seed is set, running it twice can generate different results.
You probably noticed that the previous plots did NOT show any clustering of the data (all cells are the same color). This is because PCA, t-SNE and UMAP are NOT clustering methods, they are dimension reduction methods. Now, we need to run a clustering method to be able to visualize predicted clusters.
This is where things get very arbitrary, because a human viewer would like to better match the clusters to the UMAP/t-SNE representation. However, clustering methods are usually run on the PCA data, which contains more information than the reduced UMAP/t-SNE and thus may not match what we see in the UMAP/t-SNE. It would be wrong, though, to run the clustering on the UMAP / t-SNE data, as these two methods do not preserve the data structure and simplify the data (only 2 dimensions), because they are only meant for visualization.
# Here I (again) arbitrarily select 10 PCs from the PCA
data.seurat <- FindNeighbors(data.seurat, dims = 1:10)
data.seurat <- FindClusters(data.seurat, resolution = 0.5) # this resolution factor can be tuned for matching the expectation/hypothesis of the user# Number of cells per cluster
table(data.seurat$seurat_clusters)
0 1 2 3 4 5 6 7
2107 1352 1150 1001 502 418 319 151
# Now the UMAP shows the cluster colors by default
DimPlot(data.seurat, reduction = "umap") # To visualize the results of the UMAPQuestion: Is there 8 clusters in our data? It’s difficult to know if it’s an optimal number for our dataset, so you can play with the resolution parameter to change its behaviour. Of note, another way to know if 2 clusters can be merged together (or not), is to check the marker genes of these clusters (next step).
We can check the top marker genes of one cluster (vs all other cells)
# Find all markers of cluster 2
cluster2.markers <- FindMarkers(data.seurat, only.pos = TRUE, ident.1 = 2, min.pct = 0.25, logfc.threshold = 0.25)
cluster2.markers <- cluster2.markers[with(cluster2.markers, order(-avg_log2FC)),] # order by fold-change
head(cluster2.markers, n = 10) # top 10 p_val avg_log2FC pct.1 pct.2 p_val_adj
Hpd 0.000000e+00 3.510422 0.422 0.036 0.000000e+00
grh 0.000000e+00 3.266423 0.879 0.152 0.000000e+00
CG34120 0.000000e+00 3.207583 0.844 0.129 0.000000e+00
CG33110 0.000000e+00 3.183478 0.560 0.055 0.000000e+00
CG3961 0.000000e+00 3.087344 0.873 0.311 0.000000e+00
BomS3 2.198969e-138 2.770804 0.281 0.056 2.818639e-134
CG11147 0.000000e+00 2.757191 0.656 0.114 0.000000e+00
vvl 0.000000e+00 2.747815 0.750 0.200 0.000000e+00
ELOVL 4.442153e-284 2.693042 0.567 0.132 5.693952e-280
CG5973 0.000000e+00 2.594608 0.813 0.212 0.000000e+00
You can also run the function for all your clusters at once:
# Find all markers for all clusters
all.markers <- FindAllMarkers(data.seurat, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)And then subset the aggregated table to study a particular cluster:
# Subset the dataset for displaying only results for cluster 2
cluster2.markers <- subset(all.markers, cluster == 2) # Subsetting, to retrieve only the ones for cluster 2
cluster2.markers <- cluster2.markers[with(cluster2.markers, order(-avg_log2FC)),] # order by fold-change
head(cluster2.markers, n = 10) # top 10 p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
Hpd 0.000000e+00 3.510422 0.422 0.036 0.000000e+00 2 Hpd
grh 0.000000e+00 3.266423 0.879 0.152 0.000000e+00 2 grh
CG34120 0.000000e+00 3.207583 0.844 0.129 0.000000e+00 2 CG34120
CG33110 0.000000e+00 3.183478 0.560 0.055 0.000000e+00 2 CG33110
CG3961 0.000000e+00 3.087344 0.873 0.311 0.000000e+00 2 CG3961
BomS3 2.198969e-138 2.770804 0.281 0.056 2.818639e-134 2 BomS3
CG11147 0.000000e+00 2.757191 0.656 0.114 0.000000e+00 2 CG11147
vvl 0.000000e+00 2.747815 0.750 0.200 0.000000e+00 2 vvl
ELOVL 4.442153e-284 2.693042 0.567 0.132 5.693952e-280 2 ELOVL
CG5973 0.000000e+00 2.594608 0.813 0.212 0.000000e+00 2 CG5973
We can then visualize the expression of the top DE genes found for cluster 2 across all clusters:
VlnPlot(data.seurat, features = c("Six4", "eya", "stg", "Gmap", "apt", "CG12535"))Or as gene expression gradient within our UMAP / t-SNE visualization:
FeaturePlot(data.seurat, features = c("Six4", "eya", "stg", "Gmap", "apt", "CG12535"), ncol = 3)This is the final and most tedious task to do, since it often requires manual check of the DE genes, and some expert knowledge:
Once all cell types are identified, we can display them on the UMAP:
new.cluster.ids <- c("Cluster 0", "Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5", "Cluster 6", "Cluster 7")
names(new.cluster.ids) <- levels(data.seurat)
data.seurat <- RenameIdents(data.seurat, new.cluster.ids)
DimPlot(data.seurat, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()Question: Now you can annotate your clusters and replace these cluster names by curated cell-types you identified.
Note 1: The DE genes found for each cluster vastly depend on the background cells (the other clusters). Which means that if you have only a subset of cell types in your dataset, you may find many DE genes that seem to be specific to your cluster of interest, simply because they are not expressed in the other clusters (but could be expressed in other cell types, if they are not present in your dataset)
Note 2: There are methods for automatic prediction of clusters based on curated marker genes database (for e.g. Garnett), but they are not yet perfect (especially for rare cell types) and don’t have curated database for all species/tissues
Note 3: There is also a growing amounf of studies that rather perform “integration” of cell atlases with their dataset, in order to see which annotated cell types their cluster would match. This requires specific integration step at the beginning of the project using Seurat integration method or external methods such as Harmony
Using this script, you should now be able to run the basic Seurat pipeline on any dataset. Of course, there is not a unique pipeline that will work on all datasets, and it requires to be adapted/tuned for each dataset specificities. But at least, this course will give you the canvas to start an analysis, and maybe go beyond, and perform more in-depth downstream analyses once you feel more secured.
Here, the tutorial is very oriented, so that you don’t need to know much R to be able to run the analysis, but if you want to go deeper, then of course you will have to learn R, or you will be quickly blocked. There are good online books from the CRAN community, or more recent one that work a bit like this tutorial. You can also attend online courses, but in any case, I would strongly recommend to follow any of these courses/tutorials if you intend to perform single-cell analyses in the future.
Of course, it is possible to run this whole analysis in R without using the Seurat framework. An online book, “Orchestrating single-cell analysis with Bioconductor”, is an excellent resource for performing this. It can also be run in Python, using the scanpy package, if you prefer Python over R.